Empiric Introduction to Light Stochastic Binarization
نویسنده
چکیده
We introduce a novel method for transformation of texts into short binary vectors which can be subsequently compared by means of Hamming distance measurement. Similary to other semantic hashing approaches, the objective is to perform radical dimensionality reduction by putting texts with similar meaning into same or similar buckets while putting the texts with dissimilar meaning into different and distant buckets. First, the method transforms the texts into complete TFIDF, than implements Reflective Random Indexing in order to fold both term and document spaces into low-dimensional space. Subsequently, every dimension of the resulting low-dimensional space is simply thresholded along its 50th percentile so that every individual bit of resulting hash shall cut the whole input dataset into two equally cardinal subsets. Without implementing any parameter-tuning training phase whatsoever, the method attains, especially in the high-precision/low-recall region of 20newsgroups text classification task, results which are comparable to those obtained by much more complex deep learning techniques.
منابع مشابه
Better Synchronous Binarization for Machine Translation
Binarization of Synchronous Context Free Grammars (SCFG) is essential for achieving polynomial time complexity of decoding for SCFG parsing based machine translation systems. In this paper, we first investigate the excess edge competition issue caused by a leftheavy binary SCFG derived with the method of Zhang et al. (2006). Then we propose a new binarization method to mitigate the problem by e...
متن کاملLCFRS binarization and debinarization for directional parsing
In data-driven parsing with Linear Context-Free Rewriting System (LCFRS), markovized grammars are obtained through the annotation of binarization non-terminals during grammar binarization, as in the corresponding work on PCFG parsing. Since there is indication that directional parsing with a non-binary LCFRS can be faster than parsing with a binary LCFRS, we present a debinarization procedure w...
متن کاملCombining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
It has recently been shown that different NLP models can be effectively combined using dual decomposition. In this paper we demonstrate that PCFG-LA parsing models are suitable for combination in this way. We experiment with the different models which result from alternative methods of extracting a grammar from a treebank (retaining or discarding function labels, left binarization versus right ...
متن کاملExtreme Value Theory Based Text Binarization In Documents and Natural Scenes
This paper presents a novel image binarization method that can deal with degradations such as shadows, nonuniform illumination, low-contrast, large signal-dependent noise, smear and strain. A pre-processing procedure based on morphological operations is first applied to suppress light/dark structures connected to image border. A novel binarization concept based on difference of gamma functions ...
متن کاملUsing Physically Based Rendering to Benchmark Structured Light Scanners: Appendix
Although the simulator generates realistic images, we need to verify that the range scans we produce using the synthetic scanner contain artifacts similar to those acquired in real scans. A common operation performed when using binary coded patterns is the binarization process. This identifies each pixel as either lit or unlit. Our validation procedure, which is based on this binarization opera...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014